Skip to content

OCPBUGS-54188: Update Pod interactions with Topology Manager policies #95111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amolnar-rh
Copy link
Contributor

@amolnar-rh amolnar-rh commented Jun 23, 2025

Version(s): 4.12, 4.14, 4.15, 4.16. 4.17, 4.18, 4.19, 4.20

Issue: https://issues.redhat.com/browse/OCPBUGS-54188

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2025
@openshift-ci-robot
Copy link

@amolnar-rh: This pull request references Jira Issue OCPBUGS-54188, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Version(s):

Issue:

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 23, 2025
Copy link

openshift-ci bot commented Jun 23, 2025

@amolnar-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/validate-portal a9adc33 link true /test validate-portal

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

@amolnar-rh: This pull request references Jira Issue OCPBUGS-54188, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Version(s): 4.12, 4.14, 4.15, 4.16. 4.17, 4.18, 4.19, 4.20

Issue: https://issues.redhat.com/browse/OCPBUGS-54188

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments inside

@@ -32,9 +32,11 @@ spec:
memory: "100Mi"
----

If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications.
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure here. When the topology manager policy is not None, it will indeed try to align all pods, but for pods whose QoS class is not Guaranteed, all the alignment logic will degrade in a no-operation. So, yes, we will do all the dance, but the result will be "no pinning, no alignment"

@@ -32,9 +32,11 @@ spec:
memory: "100Mi"
----

If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications.
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications.
When the Topology Manager policy is set to `none`, the relevant containers are pinned to any available CPU without considering NUMA affinity. This is the default behavior and does not optimize for performance-sensitive workloads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually mean "pinning" as "run on a precise set of resources", so not sure the terminology is best here. "pinned to anything" is something I don't see used much, but I'm also not a native english speaker.

Copy link
Contributor Author

@amolnar-rh amolnar-rh Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

the relevant containers are assigned to run on any available set of CPUs...

Or should we keep it vague and instead of specifying CPU say resources?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the relevant containers are assigned to run on any available set of CPUs..." seems fine to me

If the selected policy is anything other than `none`, Topology Manager would not consider either of these `Pod` specifications.
If the selected policy is anything other than `none`, Topology Manager would consider either of the `BestEffort` or the `Burstable` QoS class `Pod` specifications.
When the Topology Manager policy is set to `none`, the relevant containers are pinned to any available CPU without considering NUMA affinity. This is the default behavior and does not optimize for performance-sensitive workloads.
Other values enable the use of topology awareness information from device plugins. The Topology Manager attempts to align the CPU, memory, and device allocations according to the topology of the node when the policy is set to other values than `none`. For more information about the available values, see _Additional resources_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device plugins and core resources (cpu, memory)

@@ -53,6 +55,6 @@ spec:
example.com/device: "1"
----

Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod.
Topology Manager would consider this pod. The Topology Manager would consult the Hint Providers, which are CPU Manager and Device Manager, to get topology hints for the pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU Manager, Device Manager and Memory Manager

@@ -16,15 +16,12 @@ This is the default policy and does not perform any topology alignment.

`best-effort` policy::

For each container in a pod with the `best-effort` topology management policy, kubelet calls each Hint Provider to discover their resource
availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.
For each container in a pod with the `best-effort` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is technically correct but maybe too low level. The observable behavior of the best-effort policy is that the kubelet will try to align all the required resources on a NUMA node, but if the allocation is impossible (no enough resources) the allocation will spill into other NUMA nodes unpredictably. The pod will always be admitted.

Copy link
Contributor Author

@amolnar-rh amolnar-rh Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to rephrase it. WDYT?

Kubelet tries to align all the required resources on a NUMA node according to the preferred NUMA node affinity for that container. Even if the allocation is not possible due to insufficient resources, the Topology Manager still admits the pod but the allocation is shared with other NUMA nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I'm leaving out "unpredictably" is because I feel like we'd need to explain what that means exactly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your rephrasing seems fine to me, thanks

For each container in a pod with the `restricted` topology management policy, kubelet calls each Hint Provider to discover their resource
availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager rejects this pod from the node, resulting in a pod in a `Terminated` state with a pod admission failure.
For each container in a pod with the `restricted` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a `Terminated` state with a pod admission failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observable behavior here is that the kubelet will determine the theoretical minimal number of NUMA nodes that can fullfil the request, and reject the admission if the actual allocation would take more than that number of NUMA nodes; otherwise the pod will go running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean that the "pod will go running"? Do you mean that the pod is admitted and it will run/operate?

Except for that part, I rephrased it:

kubelet determines the theoretical minimum number of NUMA nodes that can fulfill the request. If the actual allocation requires more than the that number of NUMA nodes, the Topology Manager rejects the admission, resulting in a pod in a Terminated state with a pod admission failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean that the "pod will go running"? Do you mean that the pod is admitted and it will run/operate?

yes, precisely.


`single-numa-node` policy::

For each container in a pod with the `single-numa-node` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.
For each container in a pod with the `single-numa-node` topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a `Terminated` state with a pod admission failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observable behavior is that the kubelet will admit the pod iff all the resources required by the pod itself can be allocated on a same NUMA node. Arguably, its the same as Restricted with minimal number of NUMA nodes = 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL:

kubelet admits the pod if all the resources required by the pod can be allocated on the same NUMA node. If a single NUMA node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants